FiD-A123  651  DESIGN  OF  RELATIONAL  DATABASE  BENCHMARKS(U)  NAVAL  1/2 

POSTGRADUATE  SCHOOL  MONTEREV  CA  V  C  STONE  JUN  82 


UNCLASSIFIED 


F/G  9/2 


NL 


MICROCOPY  RESOLUTION  TEST  CHART 

NATIONAL  BUREAU  OF  STANDARDS- 1963-A 


DESIGN  OF  RELATIONAL  DATABASE  BENCHMARKS 


by 

Vincent  Courtney  Stone 
June  1983 


HhesisAdvisoin^^^^^^^DavidK^Hsiao 
Approved  for  public  release;  distribution  i 


SECURITY  CLASSIFICATION  OF  THIS  FAOC  (*hm  Omtm  htM« 

|  WPOrr  DOCUMENTATION  PAGE 

READ  INSTRUCTIONS 

BEFORE  COMPLETING  FORM 

l.  report  number - 

mm 

^3.  RECIPIENT'S  CATALOG  NUMBER 

4.  TITlC  (mmi  SarAMII*) 

Design  of  Relational  Database  Benchmarks 

s.  tyre  of  report  a  period  covered 
Master's  Thesis 

June,  1983 

S.  PERFORMING  ORG.  REPORT  NUMBER 

f.  iuf *6kml  . —  “1 

Vincent  Courtney  Stone 

S.  CONTRACT  OR  GRANT  NUMBERfa) 

1.  performing  organization  namk  ano  aooriss 

Naval  Postgraduate  School 

Monterey,  California  93940 

10.  PROGRAM  ELEMENT.  PROJECT.  TASK 

AREA  A  WORK  UNIT  NUMBERS 

11.  CONTROL  LINS  OFFICS  NAM*  ANO  AOOAKSS 

Naval  Postgraduate  School 

Monterey,  California  93940 

12.  report  OATE 

June  1983 

IS.  NUMBER  OF  PAGES 

112 

irMhlTWil  aAcNCY  NANS  A  AOONlssrir  cAml  trmm  CmmUMI  Off  1  cm) 

IS.  SECURITY  CLASS,  (ml  tltlm  report) 

UNCLASSIFIED 

Approved  for  public  release;  distribution  unlimited 


17.  DISTRIBUTION  STATEMENT  (ml  mm  mkmfmmt  mmlmtmi  In  Mmmk  M.  II  Mlmrml  kmmt  ftaport) 


i*.  supplementary  notes 


1*.  K(Y  NKOI  (CmmUmmm  mm  immmrmm  mMm  II  mmmmmmmt  mi  MmHlY  Ar  Mmmk  »■>«! 

benchmarking,  database  machines 


m  iBtiiCt  fCwM—  —  tmmmmm  Mim  II  mmmmmmm wf  mmi  limmu tr  Ar  Mmmk  n— iirj 

Performance  measurements  of  a  database  machine  reflect  not  only 
the  processing  power  of  the  machine,  but  also  the  size  and 
structure  of  the  database.  It  is  therefore  useful  to  construct 
databases  for  performance  measurements  of  database  machines. 
Furthermore,  it  is  useful  to  utilize  synthetic  data,  such  that 
the  volume  of  the  reply  can  be  predicted  for  a  given  query  and  the 
structure  and  attributes  of  the  database  can  be  varied  (Continued) 

00  t  jam*ts  MTS  coition  or  i  nov  «•  is  obsolete 


$/M  0102*  LF-014-AA01 


^  SICUNITV  CLASSIFICATION  OP  THIS  PAGE  Data  tmlmne 


MCUMTV  CLAMirtCATtOM 


ABSTRACT  (Continued)  Block  #  20 

for  lncended  test  queries.  Conducting  measurement  studies  using 
a  synthetic  database  contributes  to  the  generality  of  the  results 
when  different  test  queries  are  employed.  A  parameterized  progra 
is  described  herein  which  can  be  used  to  generate  various  relat¬ 
ions  for  a  synthetic  database.  The  experiences  in  constructing 
and  using  the  database  generator  are  described.  It  is  suggested 
that  given  sufficient  information  on  real-world  databases  the 
generator  may  be  useful  for  modeling  them  as  well  as  for  creating 
databases  for  benchmark  tests. 


Accession  F«r 

NT  IS  <T  :  V  - 
TV'  7  i . 

U:1  .  "  : 


Approved  for  public  release,  distribution  unlimited. 


Autnor : 


Approve! 


Design  of 

Relational  Database  Pencnmaris 
by 

Vincent  Courtney  Stone 
Lieutenant  Commander,  United  States  Navy 
3.S.,  United  States  Naval  Academy,  li?4 


Submitted  in  partial  fulfillment  of  tne 
requirements  for  tne  degree  of 

MASTER  OF  SCIENCE  IN  COMPUTER  SCIENCE 

from  tne 

NAVAL  POSTGRADUATE  SCHOOL 
June,  iy«3 


_ _ _ _ 

by  :  _ /0-QuaajP(  (C. _ ^ _ 

Tflesis  Advisor' 

Second  Reale r 

t  • _ _ 

’Cnal rma n 7~D e pa r t men t -o f  Compute r~S  olence' 

_ t,  _ 

Dean  of  Information  aiTi^t;  ^ y  Sciences 


ABSTRACT 


Performance  measurements  of  a  database  macnir.e  reflect 
not  only  tne  processing  power  of  tne  mackine,  cut  aiso  tne 
size  and  structure  of  tne  database.  It  is  tnerefore  useful 
to  construct  databases  for  performance  measurements  of  data¬ 
base  machines.  Furtnermore,  it  is  useful  to  utilize 
syntnetic  data,  sucn  tnat  tne  volume  of  tne  reply  can  be 
predicted  for  a  riven  query  and  tne  structure  and  attributes 
of  tne  database  can  ee  varied  for  intended  test  ^ueri^s. 
Conducting  measurement  studies  usinr  a  syntnetic  database 
contributes  to  tne  generality  of  tne  results  wnen  different 
test  queries  are  employed.  A  parameterized  program  is  de¬ 
scribed  nerein  wnicn  can  be  used  to  generate  various 
relations  for  a  syntnetic  database.  Tne  experiences  in  con¬ 
structing  and  using  tne  database  generator  are  described  It 
is  suggested  tnat  given  sufficient  information  on  real-world 
databases  tne  venerator  may  be  useful  for  modeling  tnem  as 
well  as  for  creating  databases  for  oencnmarK  tests. 
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I.  BENCHMARKS  £0g,  DATABASE  ^ACftlNEg. 

A.  PERFORMANCE  MEASUREMENTS 

In  comparing  database  management  systems  fTBvss)  an 
important  factor  is  tneir  performance.  One  way  to  compare 
DBMSs  is  to  run  specific  applications  under  a  variety  of 
systems.  Sacn  system  can  be  'fine~tunea'  to  give  tne  oest 
result.  An  evaluation  cased  on  sucn  a  metnod  is  costly  and 
time-consuming.  Often  such  a  metnod  may  be  infeasible.  In 
many  cases,  a  dataoase  for  tne  specific  applications  may 
not  even  exist.  As  a  second  metnod,  an  evaluation  could  oe 
made  on  tne  Dasis  of  performance  measurements  of  existing 
datatases.  Ttus  metnod  is  less  costly  and  less  time-consu¬ 
ming.  However,  tne  following  questions  arise.  Is  tne 
existing  dataoase  sufficient  to  support  intended  applica¬ 
tions?  Are  tne  applications  good  ror  condu  ting  relative 
performance  evaluation  of  different  DBMSs? 

It  is  impractical  to  perform  sucn  direct  comparison  of 
DBMSs.  Adapting  an  application  to  several  systems  for  eval¬ 
uation  purposes  is  not  practical.  Evaluation  Dased  on 
existing  databases  is  subject  tc  interpretation  error.  Tne 
increasing  number  of  DBMSs  ma«ces  it  imperative  tnat  some 
metnod  is  to  be  devised  to  do  comparative  performance 
measurements . 
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B,  BENCHMARKING 


Tne  concept  of  a  standard  for  measuring  performance  is 
not  new.  Tne  standard  is  usually  inown  as  a  benchmark* 
after  tne  martcers  used  by  surveyors  in  establ  suing  a  common 
reference  point  for  tneir  measurements.  For  example*  Mount 
Diablo  (a  mountain  east  of  San  Francisco)  is  used  as  tne 
reference  point  in  surveying  mucn  of  Nortnern  California  due 
to  its  long-ran^e  visibility.  A  metnod  for  measuring- 
similar  items  in  reference  to  a  standard  is  called 
benchmarking. 

Precedents  for  bencnmariin*  exist  in  measuring  tne  per¬ 
formance  of  computer  systems.  Tne  Gibsor-Mix  method 
measures  the  execution  time  of  a  specific  set  of  application 
programs  for  benchmarking  computer  systems.  Tne  expected 
performance  of  a  system  could  be  computed  by  cnaracteri zing 
tne  expected  workload  as  a  mix  of  Jobs  from  the  standard 
set. 

It  is  proposed  tnat  a  set  of  application  programs  can  oe 
devised  to  measure  tne  performance  of  DBMSs .  Using  tnese 
benchmark  measurements*  it  will  oe  possible  to  compare  tne 
performance  of  various  i/BMSs.  Tne  measurements  can  oe  ana¬ 
lyzed  to  su?«?est  stren?tns  and  weaknesses  of  tne  DBvSs. 

C.  QUANTITIES  TO  3E  MEASURED 

Tne  generally  accepted  performance  index  for  a  DBVS  Is 
tne  response  time.  Defining  tne  response  time  as  tne 


primary  performance  index  is  tne  scope  of  tnis  research. 
However,  tne  response  time  is  based  on  several  factors. 
Among  tnese  factors  are  tne  tine  to  process  tne  auery,  tne 
time  to  access  tne  lata,  tne  time  to  process  lata,  am  tne 
time  to  return  tne  lata.  For  a  DBMS  running  on  a  mainframe 
computer,  tne  effects  of  otner  woncloaa  on  tne  response 
time  must  also  be  considered. 

A  measurement  of  tne  response  time  is  more  significant 
wnen  measurements  of  its  components  are  provided.  Some 
simplifying  assumptions  may  be  male.  Tfte  first  sum  assump¬ 
tion  is  tbat  tne  rate  of  accessing  lata  in  tne  database  is 
constant.  Tne  second  is  tnat  tne  rate  of  returning  pro¬ 
cessed  lata  is  constant.  However,  tne  time  involvei  in  tne 
processing  of  queries  and  tne  time  involved  in  tne  proces¬ 
sing  of  data  may  vary  greatly  among  database  operations.  It 
order  to  recori  the  variance  of  time  among  tne  operations, 
tests  must  be  devised  wnicn  will  indicate  tnese  components 
for  all  supported  operations. 

This  tnesis  focuses  on  measurements  of  tne  response 
time.  A  development  of  a  system  to  measure  components  cf 
tne  response  time  is  discussed.  Tne  system  involves  tne 
generation  of  a  synthetic  database.  The  system  also  mea¬ 
sures  the  bencnmartced  machine  in  using  tnat  database. 
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BENCHMARKING  RKL&XION£L  DATABASE  MARINES 

A.  THE  BENCHMARKING  ENVIRONMENT 

The  researcn  done  in  support  or  tnis  tnesis  oas  oeen 
performed  in  a  complex  environment.  Tne  complexity  involves 
multiple  macnines  and  multiple  operating  systems. 

A  Relation  Generator  (RG)  or  synthetic  relations  nas 
oeen  developed  using  Pascal  (i.e.,  IBM's  Pascai/VS)  in  a 
multiuser  environment  ( 7M/CMS  running  on  IBM  3032).  RG  is 
used  in  a  oaten  environment  ( MVS )  on  tne  same  macnine.  Tne 
relations  are  generated  in  EBCDIC-cnaracter  form.  Tnev  are 
transported  to  a  UNI7AC  1100  via  tape.  Tne  EBCDIC  riles  are 
tnen  loaded  onto  tne  nost  (i.e.,  tne  GNIVAC  computer)  and 
translated  by  tne  host  into  ASCII  riles.  Tnese  ASCII  riles 
are  finally  leaded  into  a  tacteni  dataoase  macnine  (i.e., 
Britton  Lee's  I  DM  500) . 

Tne  oactcend  macnine  and  Interface  software  for  tne  1100 
series  computers  are  marketed  oy  tne  Amperir  Corporation  or 
Cnatswortn,  California,  as  tne  RDM  1100.  Additional  mea¬ 
surements  can  oe  made  hy  bypassing  tne  part  or  tne  query 
processor  tnat  provides  terminal  support.  Tnis  is  accom- 
plisned  by  communicating  lirectly  with  tne  ouery  processor 
yla  complied  language  statements  (i.e.,  COBOL).  Tnis  does 
not  completely  bypass  tne  query  processor,  because  tne  query 
language  is  interpreted  and  cannot  oe  precompiled.  However, 


tne  results  snow  tnat  query  processing  does  not  represent  a 
significant  portion  of  tne  response  time  if  tne  cost  *or«t- 
load  is  Ugnt.  Tne  terminal  naniler  represents  also  a  small 
portion  of  tne  response  time.  Tnerefore,  tne  only  advantage 
to  tne  use  of  compiled  programs  is  tne  option  of  running  tne 
process  as  a  background  jod. 

B.  THE  ARCHITECTURE  OF  THE  SI STEM 

Tne  arcnitecture  of  tne  system  encompasses  two  major 
areas.  Tne  first  of  tnese  areas  is  tne  internal  arcnitec¬ 
ture  of  tne  I  DM  500.  Tne  second  area  is  me  nost  system 
software,  i.e.,  tne  user  interface  wnicn  runs  on  tne  nost. 

1.  Tne  Basic  Macnipe  Arcnl teciure  a^d  Various  Coifignz 
rations 

Tne  IDM  500  is  made  up  of  several  modules  connected 
to  a  common  ni*n-speed  dus  (See  Figure  1).  Tne  database 
processor  is  a  6-mnz ,  Zilog  Z -£000  series  microprocessor 
wnicn  performs  tne  DB^S  functions.  Tne  coding  for  tne 
microprocessor  is  written  largely  in  tne  C  programming  lan¬ 
guage,  along  witn  some  assembly  language  routines.  It 
comprises  about  330  t-bytes  of  machine  code.  An  optional 
module,  tne  database  accelerator  improves  tne  system  perfor¬ 
mance  by  implementing  in  nign-speed,  special -purpose 
nardware  some  of  tne  DBMS  functions  normally  performed  dv 
tne  database  processor. 
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Tne  cacne  memory  is  composed  of  64i-oit  dynamic  HAV 
cnips.  Tne  basic  configuration  (at  tne  beginning  of  tne 
tests)  includel  one-naif  megabyte  of  memory.  Up  to  six 
megabytes  of  memory  can  oe  supported.  During  tne  testing 
period,  configurations  of  one  and  two  megabytes  nave  also 
been  usel. 

One  to  four  ii sg  controllers  may  be  installed. 
?acn  controller  supports  up  to  four  six-nuno red-mega  tyte , 
nari  dlsics.  A  controller  may  oe  installed  to  faom- 

tate  oaciclng  up  and  loading  data. 

Two  standard  nost  interfaces  are  available.  A 
I  SEE— 48P  byte-wide  parallel  interface  is  available  for  con¬ 
nection  to  mainframes  and  minicomputers.  A  second  interface 
can  be  used  to  provide  multiple  P.S-232  serial  ports  to 
microcomputers.  A  special  byte/word  interface  for  com¬ 
munication  witn  UN  I VAC  nost  computers  is  supplied  by  tne 
Amperlf  Corporation. 

2.  Tne  Database  Organization 

Tne  ID'I  50JS  software  supports  tne  relational  data¬ 
base  model.  Data  Is  stored  on  tne  diss  in  two  logical 
levels.  Tnese  levels  are  tne  system  database  and  tne  user 
databases.  At  tne  top  level,  tne  system  database  contains 
five  system  tables  and  tnirteen  database  tables.  Tne  system 
tables  contain  Information  on  nardware  configuration,  data¬ 
bases  and  current  usase.  Tne  tnirteen  database  tables 
comprise  tne  data  dictionary.  Tney  are  used  to 
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attri butes. 


users  ,  ana 


store  information  about  relations, 
security.  A  list  of  tns  system  taoies  and  tne  lataDase 
tables  is  given  in  Appendix  A. 

Although  access  to  tne  system  database  is  requires 
for  tne  creation  of  a  user  database,  an  existing  user  data¬ 
base  can  be  accessed  directly,  i.e.,  without  *oing  tnrougn 
tne  system  database.  Eacn  user  database  nas  cotn  database 
tables  and  user  taoies.  Tne  database  taDies  are  stored 
within  tne  user  database  and  may  be  accessed  in  tne  same 
manner  as  user  tables. 

The  basic  unit  of  lisic  access  is  a  2ic-byte  oiock. 
When  a  database  is  created,  a  space  allocation  in  blocks  may 
be  requested.  Tnis  allocation  may  be  increased  if  neces¬ 
sary.  Both  system  tables  and  database  taoies  are  used  by 
tne  system  to  compute  physical  addresses. 

3.  Tne  U^er  Interface 

Tne  user  interface  is  accessed  oy  invoicing  an  pro¬ 
cess  on  tne  host.  Tnis  process  is  an  interactive  query 
processor.  Tne  query  processor  parses  tne  user's  queries 
written  in  tne  Relational  Query  Language  (P.QL).  RCL  is 
Amperlf's  implementation  of  Bri ttor.-Lee 's  Intelligent  Query 
Language  (I0L).  Alternatively,  queries  may  ce  submitted  to 
tne  query  processor  from  a  compiled  COBOL  or  FORTRAN  pro¬ 
gram.  Submitting  a  compiled  program  as  a  oaten  Job,  tne 
user  may  bypass  the  query  processor's  terminal  banner. 
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However,  tne  batch  job  still  depends  on  tne  query  processor 
for  parsing  of  tne  query. 

The,  Relational  Query  Language  (RQL)  orovides  opera¬ 
tions  and  facilities  similar  to  tnose  available  on 
relational  Di'ISs  currently  running  on  mainframe  computers 
and  larger  minicomputers.  RQL  also  allows  queries  to  oe 
pre-parsel  and  stored  within  a  database.  These  stored  com¬ 
mands  limit  tne  time  required  in  tne  nost  for  parsing  and 
reduce  tne  time  required  in  the  Dacfcend  tor  tne  dataoase- 
table  looicup .  Additional  information  on  RQL  nay  be  found  in 
ft .  2  and  3J . 

Communication  with  the  ID*!  is  via  a  system  process, 
RDmlO.  RDMIO  supervises  communications  between  user  proces¬ 
ses  running  on  tne  nost  and  the  nardware  interface  to  tne 
IDm  (See  Figure  2).  Op  to  ten  users  may  access  tne  REV 
simultaneously  from  a  single  UNIVAC  nost. 


Figure  2  -  The  IDM/User  Interface 


III.  THE  BENCHMARKING  APPROACH 


A.  A  MULTI-DIMENSIONAL  PROELEM 

Creating  a  bencnmaricing  system  poses  a  problem  witn 
several  dimensions.  Tne  problem  can  ne  broken  down  into  two 
major  areas.  Tnese  areas  are  modeling  and  measurement. 

1 •  Modeling  Pro  biem« 

Tne  modeling  problems  can  be  categorized  as  DBMS- 
dependent  and  database-dependent.  Tne  DBMS-dependent 
modeling  problems  are  related  to  DBMS  scnema  and  syntax. 
Tne  database-dependent  problems  are  related  to  tne  chara — 
terlstlcs  of  tne  database  and  tne  application  to  oe  modeled. 

a.  DBMS-dependent  Problems 

Tne  three  widely  .tnown  database  models  are  tne 
nierarcnical ,  tne  networK,  and  tne  relational.  It  ras  been 
shown  that  databases  and  applications  based  on  one  of  tnese 
models  can  be  translated  to  any  other  model.  However,  tners 
is  no  accepted  oasis  for  meanineful  comparisons  of  tneir 
performance  measurement.  As  a  first  step,  tests  nave  been 
performed  in  support  for  estaoiisning  sucn  a  basis  for  DBvSs 
navin*  tne  same  underlying1  model,  specifically  tne  relation¬ 
al  model. 

b.  Database-dependent  Problems 

Tne  database-dependent  problems  are  representa¬ 
tive  of  existing  databases  and  tne  applications  wfiicn  are 
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used  on  men.  Existing  databases  vary  In  tne  conniexity  and 
In  tne  efficiency  in  which  tney  nave  been  implemented. 
Tnese  varieties  are  partly  due  to  tne  pnysicai  aata  tnat  are 
represented  in  tne  database  and  partly  due  to  tne  program¬ 
mers'  abilities  to  construct  tne  database.  Additionally, 
tne  applications  which  use  tnese  databases  also  model  tne 
pnysicai  data  represented  as  veil  as  tne  information  re¬ 
quired  of  tne  database.  Thus,  botn  existing  databases  and 
applications  must  be  modeled.  Tne  tcey  to  an  effective  and 
general  model  is  creating  one  wnicn  represents  common  cha¬ 
racteristics.  Tne  characteristics  of  databases  and 
applications  must  be  carwfully  studied  prior  to  tne  design 
of  a  general  and  effective  model.  Tne  contrasting  nature  of 
existing  databases  and  tneir  applications  present  an  ex¬ 
tremely  complex  modeling  problem. 

2 •  Measurement  problems 

DBMS  benchmark  measurements,  as  a  standard,  may  also 
represent  a  comparison  or  DBMS  performance.  This  standard 
may  be  eitner  absolute  or  relative.  Absolute  measurements 
assume  a  fixed  standard.  Relative  measurements  may  provide 
ranxings  within  a  group  of  DBMSs.  Tne  measurement  of  tne 
response  time  for  relative  ranxing  is  our  aroai. 

Experiments  must  be  constructed  carefully  and  tne 
environment  must  oe  controlled  to  provide  useable,  accurate 


measurements 


For  example,  in  performing  research  tor  this 


tnesis  it  nas  been  noticed  tnat  tne  load  on  tne  nost  can 
significantly  affect  tne  response  time  as  seen  by  tne  user. 
Similarly,  tne  response  time  is  neavily  affected  oy  tne  time 
required  to  return  tne  data  to  tne  user  at  tne  screen. 
Tnese  effects  must  be  minimized  in  order  to  ootair  measure¬ 
ments  vnicn  more  accurately  reflect  tne  performance  of  tne 
bacKend  database  macnine.  Resolution  of  measurement  pro¬ 
blems  is  discussed  in  Section  7.3. 

B.  RESOLVING  THE  MODELING  PROBLEMS 

Altnougn  tne  modeling  problems  cannot  re  eliminated, 
steps  can  be  tairen  to  minimize  tne  errors  introduced  by  tne 
modeling  process. 

1  •  DBttS-de pendent  Problems. 

Two  assumptions  can  be  made  to  minimize  tne  DE^S- 
iependent  modeling  errors.  Tnese  assumptions  concern  tne 
format  of  tne  data  and  tne  operations  used  to  access  tne 
data . 

Tbs  first  assumption  is  tnat  all  relations  are 
stored  in  tnlrd  normal  form  (3NF).  Tne  use  of  2NE  minimizes 
tne  possibility  of  inconsistent  lata.  <nile  real  databases 
do  not  use  3NF,  this  fact  doesn't  discourage  cur  assumpticn. 
Tne  bencftmarie  is  designed  to  provide  a  measurement  of  Zt'-'Ss' 
performance.  It  is  not  intended  to  tase  into  consideration 
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the  acuities  of  tnose  persons  wno  win  design  tne  databases 
(altbou*n  ease  of  use  may  De  a  consideration  in  sore 
instances),  for  tney  may  not  understand  tne  tneory  cf  ^[vF. 

Tne  second  assumption  to  ce  made  is  tnat  tne  query 
languages  used  oy  tne  DBMSs  are  logically  equivalent.  Ai- 
tnougn  differences  in  syntax  do  exist,  tney  generally  do  not 
affect  tne  breaitn  of  available  operations.  Therefore,  * 
common  set  of  queries  can  ce  implemented  in  tne  DBMSs' 
iaiiviiual  syntaxes  and  provide  tne  identical  logical  re¬ 
sult.  Any  variations  to  tnis  snould  be  noted  witn  oencnrrarx 
results.  Tne  basic  set  of  experiments  include  selections, 
projections.  Joins,  updates,  insertions  and  leietlons.  Ad¬ 
ditionally  experiments  snould  ce  performed  wmcn  test  tne 
performance  of  any  peculiar  or  powerful  operations  wnicn  a 
DBMS  may  nave  in  addition  to  tne  standard  set. 


2 •  Database-dependent  Fro  Diems 

Tne  elimination  of  database-dependent  modeling  pro¬ 
blems  involves  two  fundamental  areas.  Tne  first  of  tnese 
areas  is  tne  generation  of  a  syntnetic  database.  Tne  gene¬ 
ration  of  sucn  a  database  allows  tne  use  of  lata  »nicn  is 
generally  representative  of  existln  databases,  but  net 
specifically  representative  of  any  one.  Tne  lesion  of  tne 
syntnetic  database's  cnaracteristics  snould  De  broad.  Tnis 
ensures  tnat  it  can  oe  adapted  to  realistically  measure  tne 


performance  of  a  database  witn  its  own  cnaracteristics. 
These  characteristics  include  tne  sizes  of  tne  relations  (in 


tns  number  of  tuples  and  tuple  iengtn)  in  tne  database  am 
tne  lengtn  of  a  tuple  relative  to  biocx  size  of  tne  storage 
medium . 

Tne  second  area  involving  database  dependency 
involves  tne  applications  running  on  tne  database.  a  syn¬ 
tactic  worKload  is  reouirel  for  tne  same  reasons  as  for  tne 
syntnetic  database.  Tne  design  of  tne  syntnetic  workload 
snoull  be  broad  enou*n  to  provide  enough  results  to  be  acie 
to  fully  simulate  different  applications.  Tne  workload  is 
designed  witn  two  major  considerations.  Tne  first  conside¬ 
ration  is  support  of  tne  basic  relational  operations 
iiscussea  previously.  An  additional  consideration  tames 
into  account  tne  varying  access  patterns  of  existing  data¬ 
bases.  For  example*  a  given  application  may  repeatedly 
retrieve  only  one  tuple  at  a  time.  Anotner  will  retrieve 
many  in  one  operation.  An  important  cnaracteristic  is  tne 
locality  of  tne  data  retrieved  by  operations.  Tais  cnarac¬ 
teristic  may  produce  different  levels  of  performance  witn 
different  indexing  metnods. 

C.  THE  SYNTHESIZED  DATABASE  AND  WORKLOAD 
1 .  Jne  tJ se  of  Syn^pcsi  zeq  D^t a 

In  determining  a  set  of  bencamarfc  measurements,  it 
is  necessary  to  obtain  tne  set  vnicn  car.  be  used  on  a  wiae 
range  of  DBMSs.  It  is  also  important  tnat  tnis  set  loos  not 
favor  any  DBMS  or  class  of  DBMSs. 
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Two  approaches  could  nave  been  taken  in  octainia? 
measurements.  One  approacn  would  ce  to  perform  tests  on 
existing  databases.  The  other  approach  is  to  do  "-easure- 
ments  on  a  synthetic  database.  The  latter  allows  the 
greatest  flexioility  in  performing  operations  or.  tne  data¬ 
base.  This  is  oecause  tne  scnema  of  a  real  database  might 
not  provide  a  suitable  structure  for  performing  a  test  of 
some  operations.  Tne  scnema  of  a  syntnetic  database,  on  tne 
otner  nand  minimizes  any  bias  resulting  from  designing  tne 
tests  around  a  particular  database. 

The  researcn  for  this  thesis  is  performed  in  con¬ 
junction  with  evaluation  of  relational  database  machines. 
Eowever,  tne  installation  nas  no  relational  database^. 
Therefore,  any  tests  on  tne  DBMS  would  nave  to  be  performed 
on  either  a  synthetic  database  or  a  database  converted  from 
another  model.  Since  tne  use  of  synthetic  databases  sup¬ 
ports  a  more  general  approach  In  benchmarking,  tne  choice 
nas  been  made  to  generate  suca  databases  for  bencnmarkine 
tests . 

2 .  Types  or  Synthesized  Daia 

Synthesized  data  snould  nave  one  major  characterist¬ 
ic.  Tne  types  of  data  snould  be  oroai  enougn  to  test  tne 
supported  DBMS  operations  of  different  tyoes  of  Melds 
(i.e.,  values).  B or  example.  In  tne  research  performed  for 


this  thesis,  tne  first  two  attribute  values  of  each  relation 


nave  tne  same  numeric  value.  However,  tne  first  attribute 
value  is  store!  as  an  integer  and  tne  second  as  a  ~r.aracter 
string.  One  set  of  tests  selects  tuples  Dasea  on  tne  inte¬ 
ger  values;  a  second  set  of  tests  selects  tne  sar'e  tuples 
oased  on  cnaracter  values.  Response  times  may  oe  affected 
by  tne  processing  differences  related  to  tne  data  types. 
Additional  differences  may  result  from  tne  time  required  tc 
format  tne  data  for  output. 

3.  general  Scnema  of  Syn^ne sized  (J^e£ 

Tne  syntneslzed  data  used  for  tnls  tnesis  nas  four 
sets  of  relations.  Eacn  set  nas  several  relations  witn 
different  numbers  of  tuples.  Sacn  relation  in  a  set  nas  tne 
same  attributes.  Tne  attributes  are  similar  among  tne  four 
sets,  differing  only  in  number  and  len*tn  in  order  to  pro¬ 
vide  a  range  of  tuple  iengtns.  Table  1  snows  tne  range  or 
tuple  cnaracterist ics . 

Tne  relations  are  stored  in  several  databases.  Two 
databases  are  used  for  testing  single-relation  operations. 
Tne  first  database  contains  all  of  tne  relations  uses  in 
single  relation  testing.  Tne  second  dataoase  contains  rela¬ 
tions  wnose  tuples  are  of  100  bytes  and  200  bytes.  Tnis 
database  uses  compressed  fields  for  strings  (i.e.,  trailing 
oianlcs  are  dropped)  .  Several  lataoases  are  used  tc  provide 
relations  for  testing  Join  operations.  For  testing,  it  is 
desirable  to  spread  tne  Join  operations  over  tne  two  dis>cs 
In  tne  system.  a  full  implementation  cf  tnis  desirable 
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Tuple  Lengtns  :  100,  200,  1000,  2000  Bytes 

Relation  logical  Sizes  :  500,  1000,  2500,  5000,  10000  Tuples 

Relation  Pnyslcal  Sizes  :  50  Kilobytes  to  20  megabytes 

Attributes  s  14  (  for  100  byte  tuples',  24  (ror 

otner  tuple  lengtns) 

Attribute  Types  :  Sequential  Integer,  Random  Integer, 

Collated  Alpnanumeric,  Blocks  Sets 


17.  GENERATING  SYNTHESIZED  DATA 


A.  A  PARAMETERIZED  RELATION  GENERATOR 

Tne  Relation  Generator  (RG)  is  a  parameterized  program 
for  *eneratin*  relations  for  a  lata  case.  PG  prompts  tae 
user  concerning  tae  characteristics  of  a  relation.  First 
tae  us°r  is  instructed  to  enter  tae  relation  nam*  and  size 
(i.e.,  tae  number  of  tuples).  Tnen,  tne  program  requests 
lata  about  eaca  attribute.  Tae  data  equestec  includes 
attribute  name,  value  type  (i.e.,  integer,  string,  etc.)  and 
distribution  of  tae  attribute  values.  Tne  relations  gene¬ 
rated  are  stored  la  ASCII  files  to  simplify  transfer  between 
systems. 

1 .  Cipipiliiles 

RG  contains  routines  to  generate  sequential  nurroers, 
ranlom  numbers  (eitner  uniquely  or  nonuniqueiv) f  and  cna- 
racter  strings  in  collated  oraer  (See  Appendix  E).  Tae  user 
mav  also  specify  a  file  wnicn  contains  a  set  of  values  for 
an  attribute  to  be  used  in  generating  attribute  values. 
Tais  set  is  called  a  'value-set'  and  tne  file  is  called  a 
value-set  Hie*  I*  is  produced  by  tne  utility  program, 
7alue-set  Generator  (described  below).  Tne  actual  ran^e  or 
values  from  tne  file  to  re  used  for  an  attribute  is  called 
tne  attribute's  domain.  Tne  user  specifies  tne  number  of 
values  from  tne  value-set  to  be  included  in  tne  attribute's 


domain.  It  Is  not  necessary  tflat  tne  domain  contain  all  the 
values  in  tne  value-set.  RG  requires  tne  user  to  define  tne 
distribution  of  tne  attribute  values.  Tne  dlstri  tutior.  is 
eltner  in  discrete  blocks  or  random  or  botn.  \  discrete 
distribution  in  wnlcn  tne  attribute  values  are  randomly  dis¬ 
tributed  may  be  created  by  sorting  a  relation  containing 
discrete  oiocics  on  a  random  number  attribute. 

2  •  Tn°  D^v® Ajj 2E£.ai 

a.  Tne  Development  Snvironment 

RG  is  written  in  IJJM  Pascal/VS,  running  under 
tne  VM/CMS  operating  system.  VM/CMS  is  an  Interactive, 
multiuser  operating  system.  Beacause  of  operating  system 
limitations,  RG  has  Dean  converted  to  a  ^VS  (tatcn)  process. 
Standard  Pascal  syntax  nas  been  utilized  as  mucn  as  possi¬ 
ble.  Pascal/VS  extensions  to  tne  language  nave  teen  used. 
Additionally,  some  of  tne  file  descriptor  information  is 
specific  to  tne  operating  systems. 

b.  Tne  Development  Process 

Tne  first  step  in  tne  development  of  tne  system 
is  tne  drafting  of  a  modular  framewors.  Persons  are  then 
assigned  to  develop  tne  different  modules  of  tne  program. 
Tne  different  modules  include  tne  main  program,  tne  main 
generator  module  and  tne  individual  value-type  generator 
modules.  Tae  individual  modules  produce  specific  types  of 
values  for  tne  attributes. 
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Tne  system  nas  been  developed  usinsr  modern  soft¬ 
ware  engineering  tecnniques.  Tne  different  modules  nave 
been  debugged  separately.  Program  narnesses,  wmcn  contain 
no  logic  except  to  invoice  a  procedure,  nave  beer  used  tc 
test  procedures  and  su oprooeiures .  module  stubs,  union 
simulate  tne  actions  usually  performed  by  procedures,  nave 
been  used  in  place  of  tne  procedures  to  test  tne  main  pro¬ 
gram  and  tne  main  generator  module.  Once  debugged,  tne 
modules  nave  been  integrated  *itn  tne  main  program. 

Tne  responsibility  for  generating  relations  na? 
been  assigned  to  one  person.  Additional  development  ot  tne 
system  Involved  several  Items  In  addition  to  debugging.  » 
utility  to  generate  value-set  files  nas  also  oeen  created. 
Tnus,  tne  otner  members  of  tne  team  nave  oeen  freed  to  wor:c 
on  otner  pnases  of  tne  project, 
c.  Design  Problems 

Two  major  problems  nave  been  enrounterea  in  tne 
preparation  of  RG.  Tne  first  problem  is  tne  sire  of  tne 
relations  to  be  generated.  In  tne  original  RG  lesie-n,  all 
of  tne  United  lists  of  attribute  values  reside  in  tne  prima¬ 
ry  memory  simultaneously.  Tne  size  of  tne  largest  relation 
tnat  nas  neen  generated  is  twenty  megabytes.  Tnis  requires 
twenty  megabytes  of  tne  virtual  memory  space  just  to  store 
tne  contents  of  tne  lists.  Additional  space  would  be  re¬ 
quired  for  tne  program  and  tne  overneaa  associated  witn 
linked  lists  (i.e.,  pointers  to  memory  locations).  Tnis 
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exceeds  ttie  virtual  memory  space  available  to  a  single  user 
under  7M/C*S. 

Tnis  problem  nas  been  partially  soivei  cy  acces¬ 
sing  sequential  files  as  a  substitute  for  tne  lln-cea  lists. 
Therefore  only  one  list  of  attribute  values  at  a  tire  is 
stored  in  tne  primary  memory.  However,  a  liricea  list  of 
some  of  the  longer  attributes  generated  requires  over  two 
megabytes  of  memory  just  for  tne  data,  without  considering 
tne  space  required  for  pointers. 

Tne  second  problem  concerns  tne  transportation 
of  tne  files  of  generated  relations  to  ar.otner  system. 
(Jnder  tne  7M/CMS  system  at  tne  Naval  Postgraduate  Scnool, 
eacn  user  is  allowed  a  limited  amount  of  file  space.  Tnis 
amount  is  mucn  too  small  to  hold  most  of  tne  relatiors  gene¬ 
rated.  Additional  space  Is  available  on  a  temporary  (i.e., 
one-day)  basis.  Also  important  is  the  fact  that  while 
7I"/C*S  files  can  be  offloaded  to  tape,  tney  are  stored  on 
tape  in  a  non-standard  format.  Tnere  is  no  utility  program 
to  transfer  7M/CMS  files  to  tape  in  standard  format.  There 
is  also  no  utility  program  to  excnange  files  between  tne 
tapes  of  7VCV!S  format  and  tae  tapes  of  M7S  format. 

It  is  apparent  tnat  Vtf/CVS  is  not  tne  ideal  en¬ 
vironment  in  which  to  run  the  system.  Therefore,  it  nas 
been  necessary  to  convert  tne  system  to  run  in  tnQ  y7S  en- 
Tne  M7S  system  writes  tapes  in  tne  standard 
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vi ronment 


format 


It  also  allows  tee  user  to  nave  a  mucr.  larger 


virtual  memory  space.  In  retrospect,  It  maxes  sense  to 
develop  tne  system  in  an  interactive  system  (i.e.,  VV/CVS'. 
Fast  turnaround  contributes  to  faster  program  development, 
and  tne  interactive  environment  maxes  debugging  easier. 

B.  A  MATRIX  OF  RELATIONS 

Tne  relations  generated  by  rg  are  designed  to  support 
experiments  over  a  ranee  of  relation  sizes  and  cneracteris- 
tics.  Tnese  sizes  and  cnaracteristics  are  selected  tc  allow 
maximum  flexibility  in  pursuing  experiments  witn  a  minimal 
number  of  relations  in  tne  test  database.  Tne  parameters 
discussed  below  are  tnose  of  tne  relations  produced  in  sup¬ 
port  of  tne  benonmaraine . 

1.  Standard  Template^ 

All  of  tne  relations  are  cnaracterizea  by  tne  same 
general  template.  This  template  is  snown  in  Table  2.  Four 
specific  templates  are  derived  from  tne  general  one.  Tnese 
templates  correspond  to  tne  four  tuple  lengtns  used  for 
testing  (i.e.,  1 00  bytes,  200  bytes,  1000  cytes  and  2000 
bytes).  Sacn  template  is  used  to  generate  tne  relations  of 
various  sizes  (500  -  i 0,000  tuples).  Tnus  most  of  tne  tests 
can  be  run  on  many  relations  by  cnanging  only  tne  relation 


name  (or  tne  values  of  tne  range  variable)  in  tne  queries. 


Table  2.  3enerai  Eililicn  T§TiEiat,g 


ley  -  a  sequential  number  to  be  stored  as  an  in¬ 

teger  field 

Mirror  -  a  sequential  number  (same  as  <eev)  to  ce 

storel  as  a  onararter  strins 

Random  -  a  random  number  to  be  stored  as  an  integer 

field 

9 

Random  Unique  -  a  unique  random  number  to  be  stored  as  an 

integer  field 

Collated  -  a  cnaracter  string  to  be  stored  in  aipnabetic  or¬ 

der 

9 

Letter  -  a  random  aipnabeticai  letter 

« 

Sets  -  olocics  of  values  from  value-set  files. 

*  not  used  in  some  templates 

*  multiple  attributes  depending  on  tne  tuple  lenetn 
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£ .  Flex!  cl liiy 

Tne  relations  are  designed  to  provide  flexibility  in 
testing.  Ideally  tne  tests  to  be  performed  are  rcnowr.  before 
designing  tne  relations.  However,  tne  results  from  sore  of 
tne  tests  may  suggest  a  need  for  additional  tests  wmcn  aa^e 
not  been  previously  considered.  Accordingly,  tne  relations 
are  designed  to  allow  the  design  of  additional  tests  witncut 
generating  more  relations. 

C.  THE  GENERATING  PROCEDURE 

Tne  generating  procedure  consists  of  tnree  pnases.  Tne 
first  pnase  consists  of  designing  experiments  and  tne  rela¬ 
tions  to  be  used  in  tnose  experiments.  After  tne  relations 
nave  been  designed,  tney  must  be  created  and  transported  to 
tne  testing  environment. 

Generating  relations  is  a  simple  process.  First  VG,  is 
usel  to  generate  any  necessary  value-set  files.  Tnen,  RG  is 
used  to  generate  relations.  RG  nas  been  expanded  to  produce 
a  description  file.  Tais  file  contains  tne  attribute  names 
and  cnaracteristics  of  tne  attribute  values  in  tne  relation. 
Tne  description  lists  botn  tne  format  of  tne  generated  file 
and  tne  format  of  tne  relation  as  it  is  to  oe  stereo  in  tne 
database. 

1.  Tne  Sensratfir 

Tne  generator  system  consists  of  two  -najor  programs, 
tne  Relation  Generator  (RG)  and  tne  Value-set  Generator 


(73).  Otner  programs  and  debugging  alas  may  pe  necessary, 
depending  on  tne  envi ronment (s )  in  wnicn  tne  system  is 
lmplementea . 

a.  Tne  Relation  3er.erator  (RG) 

RG  creates  a  relation  file  basea  on  input  from 
tne  user.  It  consists  of  four  types  of  modules:  tne  main 
program,  tne  main  generator  module,  tne  individual  generator 
modules,  and  tne  collating  module. 

Tne  “lain  Module  -  Tne  main  RG  module  contains 
very  simple  logic.  RG  prompts  tne  user  for  tne  cnaracteri- 
stics  of  tne  relation  being  generated.  First,  tne  name  and 
size  (in  tuples)  of  tne  relation  is  requested.  Tnen.  tne 
user  is  asiced  to  determine  tne  cnaracteris ti cs  of  tne  first 
attribute.  Tne  attribute  cnaracteristics  are  collected  in 
an  attribute  record  (See  Table  3).  After  tne  module  oDtains 
tne  necessary  attribute  cnaracteristics,  it  invoices  tne  maim 
generator  module. 

Tne  main  eenerator  module,  as  expined  intne  next 
section  produces  United  lists  of  attribute  values  and  re¬ 
turns  to  tne  main  RG  module.  RG  tnen  invokes  tne  collate 
module  wnicn  is  detailed  in  tne  sequel.  Tne  collate  module 
produces  tuples  by  concatenating  sets  of  attribute  values. 
After  tne  relation  nas  been  generated,  tne  user  is  given  tne 
option  of  yeneratine  anotner  relation  or  ending  tne  process. 

Tne  Main  Generator  Module  -  Tne  rain  generator 
module  is  invoiced  to  produce  eacn  set  of  attribute  values. 
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Taels.  3  The  Seri  err  a  of  an  AH  rebuts.  Rgcori 


Attribute  Name 
Attribute  Type 
String  Lengtn 
Lower  Bound 

Upper  Bound 
Generate  Mode 
Value  Set  Name 
Relative  Proportions 


assigned  attribute  name 
data  type  of  attribute  values 
used  for  string  types 
first  sequential  integer  and  io 
wer  bound  for  random  integers 
upper  bound  on  random  integers 
data-type  distribution 
value-set  file  name 
discrete  distribution  specifi¬ 
cation 

random  integers 


Seed 


Tne  cnaracteristics  of  an  attribute  are  osssea  to  tne  ^ocuie 


in  an  attribute  record.  Using  tnis  record,  tne  main  "onuie 
invokes  one  of  several  individual  generator  module?,  nepen- 
lin?  on  tne  cnaracteristics  of  tne  attrioute.  Tne 
individual  generator  module  produces  a  Unfed  list  cf  attri¬ 
bute  values  witn  tne  aesirea  type  ana  distribution,  and 
returns  tne  list  to  tne  main  generator  module.  Tne  main 
eenerator  module  opens  a  sequential  file,  writes  tr.e  attri¬ 
bute  values  into  tne  file,  closes  tne  file,  and  returns  to 
tne  main  RG  module.  There  are  therefore  several  sucn  files, 
'•cnown  as  attribute  files. 

Collate  Module  -  Tne  collate  module  acts  as  a 
collator.  It  pnyslcaliy  concatenates  strings  of  attrioute 
values  to  form  a  tuple.  It  is  invoiced  to  assimilate  ail  tne 
attribute  values  in  tne  attribute  files  into  a  file  of  tne 
relation.  Information  describing  tne  attributes  is  passed 
to  tne  collator  as  an  array  of  attribute  records.  Tne 
collator  first  opens  tne  relation  file,  and  ail  tne  attri¬ 
bute  files.  Tne  relation  is  generated  a  tuple  at  a  time. 
One  attribute  value  from  sacn  file  is  read.  Tne  values  are 
concatenated  to  produce  a  tuple.  Tne  tuple  is  tnen  written 
to  tne  relation  file.  Tne  collator  repeats  tnis  process 
until  ail  tne  tuples  nave  been  produced. 

b.  Tne  Value-set  Generator  (VG) 

Tne  Value-set  Generator  (VG)  Is  a  simple  utility 
for  setting  up  value-set  files  for  RG.  VG  asxs  for  tne  name 
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anl  size  (i.e.,  tne  number  of  values)  of  tne  valve-set  file 
to  be  created.  Tne  values  are  entered  individually  am 
stored  as  strings  in  a  random-access  file  for  use  cv  ?.& . 

2 .  Tne  Conversion  Problem 

Converting  tne  program  to  run  in  tne  baton  envi¬ 
ronment  involves  several  tasis.  Tnese  are  tne  conversion  of 
interactive  programs  to  baton  programs,  tne  submission  of 
jobs  to  tne  batcn  system,  and  development  of  tne  additional 
statements  required  to  use  of  tne  batcn  file  system.  Al- 
tnougn  tne  programs  nad  already  been  debugged  in  tne  VM/CMS 
environment,  extensive  lebuffeinsr  nas  been  necessary  after 
conversion  to  MVS. 

Conversion  of  pro*rams  from  VM/CMS  to  MVS  is 
not  a  simple  process.  A  virtual  card  dec*  is  created  in  a 
VM/CMS  file  wnicn  contains  tne  source  decic,  tne  input  data 
and  the  file  data  required  by  tne  MVS  system.  Tnis  file  is 
submitted  to  tne  batcn  aueue.  Tne  input  for  RG  (i.e.,  tne 
user's  replies)  are  in  tne  carl  decs  witn  tne  program. 

Altnougn  it  is  not  necessary,  tne  source  code 
wnicn  generated  tne  instructions  to  tne  user  for  tne  input 
nas  been  removed  for  tne  MVS  versions.  Tne  VM/cms  version 
nas  been  modified  to  create  a  file  wnicn  contains  tne  user's 
responses  to  tne  program's  prompts. 

Differences  between  tne  batcn  and  interactive 
systems  caused  tne  difficulty  in  program  conversion.  Tne 


batch  system,  MVS,  requires  much  more  in  the  way  cf  file 
parameter  specifications,  and  is  mucn  less  foreivinff  wnen 
error  conditions  eiist.  Tnere  are  some  error  conditions 
wnlcn  tne  user  can  not  foresee.  For  example,  tne  system  nay 
initially  allocate  space  for  a  relation  file  on  a  volume 
wnicn  does  not  nave  enough  free  space  to  cover  secondary 
allocations.  '<nen  tnis  happens  tne  program  is  snorted. 
However,  it  is  not  possible  for  the  user  to  specify  a  parti¬ 
cular  dlsir  (i.e.,  one  with  sufficient  space)  for  file 
storage.  For  tne  two  largest  relation  files  (fifteen  and 
twenty  megabytes),  it  has  been  necessary  to  write  each  or 
tne  relations  into  two  separate  files  on  tne  Datcn  system. 
Tfee  two  files  were  tnen  comfcinel  wnen  they  loaded  into  tne 
database. 

3 .  Trans porting  the  Relations  to  t&e  Igstpgd 

a.  Transporting  tne  Cata  to  tne  Eost 

Tne  transportation  of  tne  relations  to  tne  nost 
is  a  two-step  process.  Tne  first  step  is  tne  transfer  of 
tne  relation  files  from  tne  MVS  secondary  storage  to  tape. 
A  system  utility  is  used  to  accomplish  this,  Th®  tapes  are 
tnen  transported  to  tne  nost,  tne  UNIVAC  1100,  and  a  similar 
utility  program  is  used  to  load  lata  into  tne  nost  secondary 
storage.  Tne  nost  utility  program  translates  the  EBCDIC 
tape  files  into  ASCII  Usk  flies. 

b.  Loading  Data  Into  tne  Bacsend 

The  relations  are  loaded  into  tne  oacsend  using 


a  vendor-supplied  utility  called  a  translator.  Tnis  utility 


prompts  tfte  user  for  information  about  tne  source  file,  tne 
target  database,  and  tne  target  relation 

Tne  translator  utility  may  be  run  interactively 
or  vitn  file  input.  Tne  database  into  wcicn  tne  relation  is 
to  be  loaded  must  already  exist.  Tne  relation  into  wnicn 
data  is  loaded  may  or  mav  not  exist.  Database  name,  nost 
file  name,  and  relation  name  must  be  supplied.  Additional¬ 
ly,  for  eacn  attribute  tne  attribute  name,  lengtn  of  source 
(in  ASCII  characters),  and  type  of  value  to  be  stored  in  tne 
database  must  be  supplied. 


V.  GENERATING  TEST  PROGRAMS. 
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A.  TEE  TEST  PLAN 

Tne  general  test  plan  calls  for  several  different  types 
of  experiments.  Among  tnese  are  experiments  involving  only 
one  relation  (l.e.,  selections  and  projections)  and  experi¬ 
ments  involving  more  tnan  one  dataoase  (i.e.,  joins). 

1 •  Experiments  Involving  a  Single  Relation 

The  selection  and  projection  experiments  are  de¬ 
signed  to  measure  tne  system's  performance  in  retrieving 
data  from  a  single  relation.  Tne  response  times  measured 
are  tbe  sum  of  four  variaoies:  tne  time  to  process  a  query, 
tne  time  to  access  tne  data,  tne  time  to  process  tne  data, 
and  tne  time  to  return  tne  data.  Tne  time  to  process  tne 
query  is  defined  as  tne  time  to  parse  tne  query.  By  care¬ 
fully  constructing  sets  of  experiments*  tnese  variables  can 
be  estimated. 

Since  tne  time  to  process  a  query  is  so  small,  it 
may  oe  Ignored  or  combined  witn  overnead  for  most  experi¬ 
ments.  For  experiments  wnere  it  is  significant,  tne 
query-proccesslng  time  is  minimized  to  prevent  it  from  domi¬ 
nating  tne  time  measurement,  resulting  in  a  loss  of 
precision.  Tne  REM  110?  allows  tne  parse  tree  of  a  query  to 
be  stored  in  tne  database.  Tnis  capability  allows  tne 
replacement  of  tne  processing  time,  wnicn  is  dependent  on 
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tne  nost,  witn  tne  data  access  tine,  wnicn  is  aepencent  only 
on  tne  oacKend.  Tne  additional  data  access  time  is  tne  tire 
to  access  tne  command  in  storae-e.  Tnis  is  tne  same  for 
all  stored  commands. 

Tne  largest  variables  are  tne  easiest  to  measure 
witn  precision.  Tnererore,  tney  are  measured  first  ana  teen 
eliminated  to  measure  tne  smaller  variables. 

Tne  largest  variables  are  lively  to  ce  tnose  repre¬ 
senting  tne  time  to  access,  process  and  return  aata.  Tnese 
can  be  measured  witn  simple  retrieve  commands.  A  time 

i 

measurement  of  a  retrieve  wnicn  returns  all  tne  attribute 
values  of  tne  tuples  in  a  relation  includes  tne  times  of  ail 
of  tne  four  variables.  However,  a  time  measurement  usin^  an 
a«re*ate  function  (e.*.,  count,  wnicn  returns  a  single 
count  of  tne  tuples  meeting  tne  qualifications  of  tne  auery) 
eliminates  tne  time  to  return  tne  lata.  Tnus  tnis  function 
can  be  used  effectively  to  measure  tne  time  to  access  ana 
process  tne  data  (tuples),  i.e.,  two  of  tne  four  variables. 

l 

Furtner,  an  assumption  is  made  tnat  for  simple  com¬ 
mands  tne  processor  can  process  data  at  a  rate  wnlch  is 
faster  tnan  tne  rate  tnat  data  can  oe  brougnt  into  tne 
memory  for  processing.  Tnis  allows  tne  processing  time  to 
be  ienorel.  Tnerefore,  tne  measurements  reduce  to  a  measure 
of  tne  access  time. 

f 
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Havin*  quantified  tns  lareer  variaoies,  the  time  to 
process  data  mar  fie  investigated.  It  nas  Been  assumes  tnat 
tne  processing  time  is  not  significant  for  simple  commands. 
However,  if  tne  commands  are  made  more  complex,  tnen  tne 
processing  time  is  expected  to  increase.  Vltn  a  sufficient¬ 
ly  complex  command  wnicn  involves  a  small  data-access  time, 
tne  processing  time  may  cecome  significant.  Tnerefore, 
experiments  are  conducted  wnicn  minimize  data  access  out 
vary  in  complexity.  It  is  of  interest  to  determine  wner  or 
if  tne  processing  time  Becomes  measureaBie  and  significant . 

It  is  expected  tnat  projections  operations  will 
increase  tne  processing  time.  Tnerefore,  several  tests  are 
appropriate  for  testing  projections.  Tne  first  set  of  tests 
measure  tne  effect  of  projections  on  tne  processing  time. 
Tne  second  set  cnectcs  to  see  if  tne  processing  time  is 
affected  ry  tne  type(s)  of  attrioute  values  projected  (i.e., 
integer,  string).  Tne  tnird  set  of  tests  measures  tne 
performance  of  a  projection  on  all  of  tne  attributes  versus 
a  simple  'retrieve  all'  command. 

After  tne  time  Basic  vanaBies  nave  Been  estimated, 
otner  performance  factors  are  Investigated.  Tne  use  of 
indices  can  reduce  access  time.  By  reducing  tne  amount  of 
data  crougnt  into  tne  memory,  tne  processing  time  is  aiso 
reduced.  However,  tne  processing  time  will  Be  increased  due 
to  index  access  and  searcn.  Tnerefore,  for  some  relations, 
tne  use  of  indices  may  increase  tne  response  time.  Indexing 
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requires  a  specific  set  of  tests  to  measure  its  performance 
in  various  situations.  Tne  use  of  different  types  of  in¬ 
dices  (i.e.,  clustered,  con-clustered,  multiple  xeys,  et~.) 
must  also  oe  investigated.  An  expected  factor  in  index 
performance  is  tne  ratio  of  tne  index  size  (in  oiocks  of 
storage)  to  tnat  of  tne  relation. 

Strinsr  compression  (removal  of  trailing  spaces'  is  a 
factor  wnlcn  can  affect  tne  processing  time,  tne  access  time 
and  tne  return  time.  Tne  use  of  compression  can  reduce 
blocic  storage  dramatically.  Tnis,  in  turn,  reduces  tne 
access  time.  However,  it  may  require  more  time  to  process  a 
compressed  strin*  versus  a  non-compressed  one,  if  processing 
requires  expansion  of  tne  compressed  attribute.  if  expan¬ 
sion  is  not  required  for  processin*,  tnen  tne  nost  may  nave 
to  expand  it  for  proper  formatting.  How  expensive  fin  time) 
is  tnis?  Does  tnis  compensate  for  tne  reduction  in  tne 
responnse  time  resulting  from  returning  a  smaller  amount  of 
lata  (tne  compressed  strine)  to  tne  nost?. 

Otner  performance  factors  may  be  examined  eitner 
individually  or  witnin  otner  test  procedures.  An  example  or 
tnis  is  tne  use  of  different  types  of  attributes  (i.e.. 
integer  versus  string).  a  complete  series  of  tests  can  b® 
developed  to  test  tnis  issue  in  detail.  However,  it  is  also 
appropriate  to  investigate  tnis  area  in  conjunction  witr. 
processing  time  and  projections. 
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2.  Srper^ments  Involving  More  Tnan  One  Heia ti cr 

Operations  involving  more  tnan  one  relation  (i.e., 
joins)  are  affected  oy  tne  same  time  variaoies  as  tnosa 
involving  only  a  single  relation.  Initial  testing  snouil 
involve  only  two  relations. 

It  is  expected  tnat  tne  access  time  will  become 
lominant  for  join  operations.  Tnis  is  Decause  tne  same  lata 
may  nave  to  be  accessed  repeatedly.  Memory  size  cas  ar. 
effect  on  tne  amount  of  accessing  required  in  a  join  opera¬ 
tion.  If  memory  size  is  larse  enousrn  to  allow  coin 
relations  to  be  accessed  once  and  left  in  tne  memory,  men 
tne  processing  time  may  Become  significant.  In  tnis  circum¬ 
stance  ootn  tne  access  time  and  tne  processing  time  are 
expected  to  increase  proportionally  to  tne  relation  size. 
Tne  unknown  factor  is  tne  rate  at  wnicn  tne  processing  time 
increases.  However,  it  may  oe  tnat  neitner  relation  is 
small  enougn  to  be  neid  in  tne  memory  for  processing.  In 
tnis  case  mucn  accessing  must  oe  performed.  It  may  also  be 
of  Interest  to  examine  Join  performance  Detween  tnese  two 
extremes . 

Tne  join  snouid  be  designed  to  taxe  acvantage  of  any 
size  differential  between  tne  two  relations.  If  tne  smaller 
relation  can  be  completely  neld  in  tne  memory,  tnen  it  can 
be  accessei  once  and  orouent  into  tne  memory.  Tne  larger 
relation  can  also  be  accessed  just  once  as  it  is  nrougnt. 
into  tne  memory  as  a  stream.  If,  on  tne  otner  nand,  tne 
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larger  relation  Is  Drougnt  into  tne  memory,  it  must  c? 
brought  into  tne  memory  a  portion  at  a  time.  Tne  smaller 
relation  may  nave  to  Be  reaccessel  for  eaon  portion  of  tne 
larger  relation. 

It  is  important  to  examine  tne  performance  of  .loins 
botn  witn  and  wltnout  selection.  In  performing  tnese  tests, 
tne  strategy  of  tne  operations  should  ce  examinee  '•areruiiy. 
Tne  selection  snoull  ee  performed  before  tne  actual  join 
operation  to  minimize  tne  volume  of  data  bein?  joined. 

Anotner  area  of  interest  is  tne  effect  of  incex 
usage  on  Joins.  Performance  nere  is  expected  to  improve  as 
indicated  oy  tne  single  relation  index  experiments.  However 
tne  specific  results  may  suggest  tne  efficiency  witn  wnicn 
tne  Join  operation  nas  been  implemented. 

If  inequality  Joins  nave  been  implemented ,  perfor¬ 
mance  testing  saould  os  conducted  using  tnem.  If  tney  nave 
not  been  implemented,  it  may  be  valuable  to  tnow  if,  and 
with  vnat  difficulty,  tney  can  be  simulated. 

Having  experimented  tne  join  operations  involving 
two  relations,  experiments  operations  snoula  be  conducted 
using  larger  numbers  of  relations  in  one  Join  operation.  Ey 
investigating  tne  performance  on  multiple  Join  relations,  it 
may  be  possible  to  isolate  a  fixed  overnead  for  ail  tne 
initial  Joins. 


3.  A  £i£xiii£  Jest  Plan 


A  general  test  plan  snouia  ce  developed  before  any 
or  tne  experiments  are  designed.  It  saouid  be  flexible  to 
enanle  testing  to  follow  different  patns  of  Discovery.  It 
is  expected  tnat  tne  results  of  some  experiments  may  suggest 
otaer  experiments.  Time  must  be  ailoted  for  tne  expansion 
of  any  test  set. 

However,  it  must  also  ensure  tnat  tne  a  sufficient 
range  of  data  is  ootainsd.  Tne  tests  must  cover  tne  univer¬ 
sal  operations  (i.e.,  tnose  expected  of  any  DB^S ) .  Among 
tne  universal  operations,  Known  bottlenecks  and  creaKpoints 
are  specifically  tested.  It  snouid  also  investigate  any 
specific  strengtns,  weaknesses  or  idiosyncrasies  of  tne 
DBMS. 

B.  MEASUREMENT  TOOLS 

Tne  response-time  measurements  in  tnese  experiments  were 
taken  from  tfte  oackend-macnine  clock.  Tnis  clock  nas  a 
resolution  of  1/60  second  and  an  accuracy  witnin  l/50-tn  of 
a  second.  The  response  time  of  tne  oackena  nacnioe  on  small 
relations  is  dominated  by  communications  overnead.  Tne 
minimum  response  time  is  about  one  second.  So,  of  tne  tests 
conducted,  tne  l/S0-second  interval  is  sufficiently 


accurate 


However,  if  toe  overneai  can  oe  reduced,  a  more  precise 
measuring  device  is  required.  Most  mainframe  operating 
systems  provide  a  does  witn  a  resolution  in  microseconds. 
TMs  is  not  available  in  tne  cacicend  macnine. 

C.  QUERY  SCRIPTS  VERStJS  PROGRAMS 

Two  methods  exist  for  performing  oencnmaric  experiments. 
Tnese  metnods  involve  tne  use  of  query  scripts  anc  programs. 
Tne  first  of  tnese  simulates  an  Interactive  session  acces¬ 
sing  tne  database.  Tne  actual  terminal  input  is  prepares 
anead  of  time  and  stored  in  a  'run-stream'  file,  Known  as  a 
query  script.  Tne  nost  operating  system  can  be  instructed 
to  obtain  its  input  from  a  file  instead  of  via  tne  terminal. 
Tnus  a  series  of  tests  can  be  collected  togetner  in  a 
script.  Additionally  tne  output  can  be  redirected  to  a 
file,  removing  tne  overnead  in  communicating  witn  a 
terminal. 

Tne  use  of  batcn  programs  involves  mucn  more  of  tne 
programmer's  time  m  tne  development  and  debugging  of  tne 
program.  Development  of  batcn  programs  also  represent  a 
larger  drain  on  tne  nost's  resources.  T.nis  factor  coun 
severely  affect  testing  at  many  installations. 

Since  qus ries  must  ce  interpreted  wnetner  tney  cone  from 
a  batcn  job  or  a  script,  tne  use  of  eaten  programming  did 
not  offer  tne  advantages  of  bypassing  tne  query  processor. 
Tnerefore,  tnere  is  some  question  wnetner  or  not  a  eaten 


program  would  provide  superior  performance  results.  Tris 
question  and  tne  ease  of  development  of  query  seniors 
suggest  tnat  tne  use  of  query  scripts  is  tne  aesirea  ret"oa. 
If  oaten  programming  offers  a  significant  performance  im¬ 
provement,  additional  testing  must  oe  performed  using  eaten 
Jons.  Here  it  would  be  wise  to  run  a  complete  oattery  of 
tests  in  tne  interactive  environment,  followed  oy  a  sunset 
of  tnese  tests  in  tne  oaten  environment.  Tnis  suDset  sr.ouii 
ne  designed  to  test  areas  wnere  tne  oaten  process  may  nave 
its  most  impact  (i.e.,  tne  data  return  time). 

D.  INTERPRETING  THE  DATA 

Tne  interpretation  of  data  is  a  very  important  part  of 
tne  testing  pnase.  Tnere  are  two  reasons  for  tnis.  First, 
conclusions  cannot  be  drawn  from  raw  lata.  Second,  Timely 
interpretation  enables  tne  persons  conducting  tne  experi¬ 
ments  to  analyze  tne  results  and  identify  furtner  testing. 

A  collection  of  raw  data  is  very  nard  to  interpret. 
Tnerefore,  any  results  obtained  snould  oe  grapnel  immediate¬ 
ly.  (Irapning  tne  results  immediately  allows  rapid 
identification  of  errors  and  unexpected  results.  Related 
results  snould  also  be  prapnea  toeetner.  For  example,  all 
tne  results  from  a  ^uery  applied  to  relations  of  different 
tuple  len?tn  and  relation  size  snould  be  erapned  to^etner. 

Once  tne  raw  data  is  analyzed,  tne  grapns  may  be  re¬ 
fined.  Tne  grapn  axes  may  be  varied  as  appropriate.  For 
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exanple,  tne  response  tine  nay  De  grapned  against  tne 
lengtn ,  against  tne  relation  size  (in  tuples  or  tne  n 
of  blocirs  of  tne  storage  space  occupied)  and  asrainst 
quantity  of  lata  returned  to  tne  user. 
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VI.  CONCLUSIONS 


A.  RESULTS 

Tne  results  obtained  from  testing  several  configurations 
of  a  relational  database  macftine  nave  provided  a  oasis  for 
developing  a  general  set  of  benchmark  tests  for  relational 
database  machines.  Tne  benchmarking  tests  nave  been  mostly 
machine  independent.  Altnougn  a  testing  metnocoiogy  is 
provided  herein  with  enough  results  on  certain  configura¬ 
tions,  additional  testing  is  necessary.  This  testing  snouli 
be  performed  on  otner  DBMSs,  preferably  witn  different  cha¬ 
racteristics,  to  ensure  that  tne  test  is  complete  and  not 
macnine-specific.  Tne  results  of  testing  selection  ann 
projection  operations  are  described  in  L4j .  Results  from 
performing  tests  on  join  operations  are  described  in  [bj . 

1 •  General  Result^ 

The  response  time  nas  been  sn  wn  to  be  proportional 
to  the  time  required  to  access  tne  lata.  This,  in  turn,  has 
been  shown  to  be  proportional  to  pnysical  size  of  the  data¬ 
base.  Methods  used  to  reduce  tne  amount  of  data  to  ce 
brought  into  tne  memory  for  processing  ( suon  as  indexing  ana 
string  compression)  improve  tne  response  time. 

The  response  time  is  also  proportional  to  tne  amount 
of  data  returned  to  tne  user.  In  tne  case  of  tne  °.LM  lltftf, 
the  time  reauired  to  return  tne  lata  is  tne  largest 
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component  of  tne  total  response  time.  If  tne  necessarv 
information  is  obtained  via  aggregate  functions,  tne  re¬ 
sponse  time  is  greatly  improved.  It  is  not  possible  to 
determine  now  muon  of  tne  response  time  is  due  to  tne  racx- 
end  macnine  and  now  muon  is  due  to  tne  cost.  Fowever, 
loading  tne  nost  definitely  degrades  tne  response  time.  An 
analysis  of  tne  response  tire  under  various  load  ''or.di t i ons 
in  tne  nost  may  lead  to  a  distinction  of  tne  nost  response 
time  vs.  tne  bacsend  response  time. 

Tie  time  required  to  process  queries  and  tne  time 
reauired  to  process  data  in  tne  memory  are  reiatlvelv  small 
for  the  RD^  11  Tnis  may  not  oe  true  for  otner  systems. 
Tnerefore,  it  is  imperative  tnat  tnese  areas  oe  carefuixy 
examined  wnen  adapting  tne  proposed  tests  to  systems  witn 
different  arcni tectures . 

Tne  results  of  tne  experiments  snow  tnat  tbKSs  do 
have  characteristics  wnicn  may  oe  measured.  A  well-con¬ 
ceived  series  of  tests  can  measure  an  installation's 
performance,  and  gain  an  indication  of  its  performance  am 
its  'personality.'  Tnese  tests  can  oe  used  to  compare  D3!*Ss 
against  eacn  otner.  For  tne  D3MS  implementor,  tr.e  tests 
also  provide  a  metnod  of  determining  poorly  implemented 
parts  of  tne  system. 


2  •  ISiSilSa  °  2S.U1IS. 

Tne  experiments  wnicn  nave  been  performed  nave 
supported  two  different  types  of  stuay.  Tne  first  is  t ne 
actual  measurement  of  tne  oactcena  macnine's  performar.ee 
(albeit,  witn  lignt  load  and  few  configurations).  Tne  RDM 
1100  provides  a  comprenensi ve  (aitnougn  uncompiete)  rela¬ 
tional  model  wfiicn  successfully  offloads  DiMS  tasirs  from  tne 
nost.  Since  evaluation  of  tne  macnine  was  conducted  simul¬ 
taneously  witn  tne  research,  tne  tasu  of  evaluating  it  nas 
been  accompiisned .  Some  areas  tnat  nave  not  been  fully 
investigated  are  due  to  tne  lacic  of  time.  Other  areas  that 
nave  not  been  fully  investogated  are  due  to  incomplete 
implementation.  As  an  example  of  tnese  areas,  tne  use  of  ALL 
in  a  retrieve  s  is  contigent  upon  tne  number  of  attributes. 
At  one  point,  tne  use  of  ALL  on  a  relation  witn  a  lar<?e 
number  of  attributes  results  in  only  an  error  message.  After 
installation  of  tne  accelerator,  tne  use  of  ALL  naits  tne 
command.  After  tne  accelerator  is  removed,  tne  prooiem  of 
halting  persists.  Another  deficiency  noted  nas  teen  tne 
inability  to  perform  an  inequality  join. 

E.  A  RELATIONAL  BENCHMARKING  METHODOLOGY 

Tne  proposed  set  of  benenmars  tests  nas  four  pnases. 
The  first  phase  consists  of  preliminary  tests  designed  to 
identify  tne  best  method  of  measuring  tne  system's  response 
time.  Tne  second  pnase  involves  isolating  tne  different 
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components  of  the  response  time.  Tne  thirl  phase  investi¬ 
gates  the  system  response  in  specific  areas.  The  feurta 
phases  verifies  tne  results  ootainel  durin*  the  phases  two 
and  three. 

1  Ong  -  Measurement  Methods 

Most  systems  have  at  least  one  mechanism  whi^n  pro¬ 
vides  a  time  measurement.  Initial  testing  is  designee  tc 
identify  tne  one  which  optimize  tne  precision  attained  ver¬ 
sus  the  ease  of  ootalning  that  time.  Once  the  measurement 
method  nas  been  cnosen,  it  is  cnecsen  to  ensure  tnat  it  is 
accurate  enouffh  to  provide  tne  necessary  precision.  It  is 
also  necessary  to  ensure  tnat  tne  overneaa  involved  m 
retrieving  the  time  loes  not  reduce  tne  precision  of  tne 
measurements  being  taicen. 

If  tne  necessary  precision  is  not  readily  available, 
then  techniques  are  available  to  increase  tne  preci  ion  of 
tne  results.  Tnese  tecnnlques  involve  performing  an  opera¬ 
tion  several  times  anl  calculating  an  average.  Tne  tech¬ 
niques  selected  must  be  reviewed  for  side  effects.  Tne  D£MS 
may  have  tne  capability  of  internally  optimzing  performance. 
For  example,  tne  order  in  which  tne  queries  are  submitted  to 
the  DBMS  may  allow  tne  DBMS  cacne  memory  management  to 
reduce  disic  access. 

In  the  case  of  the  RDM  lltfe,  two  different  methods 
of  measuring  time  could  nave  been  used.  Tne  first  metnod  is 
to  obtain  a  time  stamp  from  tne  nost  operating  system. 


Although  It  may  nave  provided  sufficient  precision,  it  nas 
not  eeen  investigated  because  of  tne  otner  methods 
available.  Tne  second  method  is  a  tine  stamp  avaiiaoie  from 
tne  I DM .  A  built-in  function  supplies  an  elapsed  time 
measurement  intervals  of  one-sixtietn  of  a  second.  Tnis 
provides  sufficient  precision  for  tne  measurements.  Since 
tne  elapsed  time  is  a  sufficient  measurement,  tne  more 
precise  measurement  nas  not  been  usee. 

V,  Pnase  Two  3  Component  Isolation 

Once  an  adequate  method  for  measuring  time  has  been 
verified,  it  is  used  to  measure  tne  performance  in  several 
specific  areas.  These  areas  are  tne  four  components  wnicn 
are  involved  in  ail  queries:  tne  time  of  process  tne  yuerv 
(i.e.,  parse  it),  the  time  to  access  the  data  in  the  data¬ 
base,  the  time  to  process  the  data  in  the  memory,  and  tne 
time  to  return  tne  requested.  Tnese  components  may  oe 
considered  tne  DBMS'S  primitive  operations.  Tnese  primi¬ 
tives  do  not  tase  advantage  of  any  metnoas  used  to  improve 
tne  response  time  of  a  given  query.  Tney  merely  measure  tne 
performance  of  tne  hardware  and  software  in  performing  spe¬ 
cific  functions.  It  nas  been  stated  tnat  a  performance 
measurement  of  some  aspects  of  a  DB*S  is  reaiiy  a  measure¬ 
ment  of  the  operatine  system.  The  operating  system  does 
effect  DBMS  response.  However,  in  tne  case  of  a  oacicenc 
machine,  this  effect  is  minimal  for  some  operations,  tfnile 
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tills  issue  may  be  debated 


it  is  not  of  Interest  to  tne 


user.  Tne  user  is  not  interested  in  tne  reasons  wny  a 
system  responds  poorly.  fie  is  interested  only  in  tne  fact 
tnat  a  system  performs  properly  and  tne  fact  tnat  tne  sy¬ 
stem's  performance  is  better  (or  worse)  tnan  tnat  of  anotner 
system.  fie  is  most  Interested  in  tne  possibility  of  obtai¬ 
ning  a  quieter  response  time  on  nis  application. 

Tne  system  primitives  are  measured  by  a  set  of 
queries  vnicn  isolate  different  aspects  of  tne  response 
time.  One  set  of  queries  is  designed  to  return  tne  same 
amount  of  data  from  relations  witn  tne  same  numDer  cf  tu¬ 
ples,  but  navlng  different  tuple  sizes.  Once  a  tuple  is  in 
tne  memory,  it  taKes  tne  same  amount  of  time  to  project  one 
attribute  from  a  set  of  1450-byte  tuples  as  from  a  set  of 
2000-byte  tuples.  Tne  difference  in  tne  response  time  for 
tne  two  queries  is  due  only  to  tne  time  necessary  to  brine 
tne  tuple  into  tne  memory.  Tne  times  required  to  process 
tne  query,  to  process  tne  data  and  to  return  tne  data  are 
tne  same. 

Tne  second  set  of  queries  is  designed  to  measure  tne 
time  required  to  return  tne  data  to  tne  user.  Tnese  queries 
return  a  different  amount  of  data  (in  bytes)  from  projection 
operations  on  tne  same  number  of  attributes  in  tne  same 
format  (l.e.,  strings,  etc.)  in  relations  wnicn  are  of  tne 
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sane  pnysicai  size.  Tnese  restrictions  assure  tact  tne 
access  time  Is  tne  same,  tne  processing  tine  is  tne  same, 
anl  tne  query  processing  time  is  tne  sane. 

Tne  tairl  set  of  queries  is  designed  to  isolate  lata 
processing  time.  In  tnis  set,  tne  queries  return  tne  same 
amount  of  lata  from  relations  of  tne  same  pnysi-al  size 
(i.e.,  identical  storage  requirements)  but  naving  a  differ¬ 
ent  number  of  tuples.  Tnis  provides  a  measurement  of  tne 
processing  required  relative  to  tne  number  of  tuples  pro¬ 
cessed.  Tne  query  processing  time,  tne  lata  access  time, 
and  tne  data  return  time  are  tne  same. 

Tne  fourtn  set  of  queries  provides  a  measurement  of 
query  processing  time.  For  operations  on  relations  of  any 
significant  size,  tnis  is  nard  r,o  measure.  Even  on  small 
relations,  it  may  not  be  significant  compared  to  simple 
system  overnead.  Tnis  set  of  queries  is  more  complex  tnan 
tne  provious  sets.  Tne  queries  are  constructed  to  allow  tne 
effects  of  tne  time  elements  (i.e.,  tne  tnree  just  measured' 
to  be  subtracted  from  tne  measurements,  leaving  only  tne 
query  processing  time.  Considering  tne  difficulty  in  ob¬ 
taining  a  precise  measurement  of  tne  query  processing  time, 
it  may  not  be  wortnwniie  to  determine  tnis  value  because  of 
its  small  size. 

Tne  previous  discussion  indicates  tnat  tne  query 
sets  are  independent.  However,  witn  proper  planning  tne 
query  sets  may  be  combined  witn  equivalent  results.  In  tne 
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graph  shown  in  Figure  3,  one  set  of  experiments  provides  a 


measurement  of  lata  access  times  ana  lata  return  tines.  Tr.e 
set  also  isolates  tne  constant  query  overnsai  'wr.icr. 
includes  t&e  query  processing  time). 

Figure  3  represents  tne  response  time  of  two  que¬ 
ries.  One  query  selects  five  percent  of  tne  tuples  ana 
returns  ail  of  tne  attrinute  fields  from  eacn  tuple.  Tne 
second  query  is  identical  except  tnat  it  selects  ten  percent 
of  tne  tuples.  Tne  queries  are  ootn  run  against  relations 
witn  100-oyte  tuples.  T&e  relations  vary  in  size  from  500 
tuples  to  10,000  tuples.  Point  A  on  tne  grapn  represents 
tne  five  percent  selection  on  10,000  tuples.  Point  t  repre¬ 
sents  tne  ten  percent  selection  on  5000  tuples.  Since  eacn 
of  tnese  queries  returns  500  tuples,  tae  time  to  return  tae 
data  is  tne  same.  T&e  over&eal  associated  witn  eacn  Query, 
including  query  processing  time,  is  t&e  same.  Tnerefore, 
the  difference  oetween  tne  response  times  represented  oy 
Points  A  and  B  is  tne  difference  is  tne  access  time  and  t&e 
processing  time  of  t&e  Queries.  Point  A  represents  a  re¬ 
trieve  on  10,000  tuples,  wnicn  is  500  oIocks  of  lisir 
storage.  Point  B  represents  a  retrieve  on  5000  tuples,  or 
250  diss  Dloctcs.  Assuming  tnat  processing  time  for  tnese 
queries  is  insignificant  relative  to  tne  access  time,  fore, 
tne  difference  in  tne  two  response  times  is  t&e  time  to 
access  250  disk  dIocjes. 
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•  V 


FIGURE  3  -  RESPONSE  TINE  VS •  SIZE  OF  RETURNED  DATA 


he  overhead  for  all  the  queries  shown  on  t re  graph 


is  the  same  and  is  represented  oy  the  common  intercept  cf 
the  vertical  axis.  If  the  time  represented  oy  Point  t  is 
adjusted  for  the  overhead  and  tne  time  to  access  252  dio^ks, 
tnen  tae  result  is  tne  time  to  return  502  122-byte  tuples. 
Therefore,  the  use  of  one  query  set  has  identified  rates  for 
accessing  data  (in  blocks  per  second)  and  returning  data  fin 
bytes  per  second) 

3.  Phase  Three  -  S intern  Response 

After  tne  time  elements  nave  Deen  measured,  a  set  of 
queries  are  performed  which  measure  the  effect  of  methods 
used  to  improve  tne  system  response.  An  example  of  tnis  is 
tne  use  of  indexes.  Theoretically,  tne  use  of  indexes 
should  improve  system  performance  Dy  decreasing  tne  amount 
of  data  accessed.  However,  tne  index  must  be  accessed  ana 
processed.  Areas  of  interest  here  involve  determining  at 
wnat  point,  if  any,  does  the  use  of  indexes  become  impor¬ 
tant.  Therefore,  performance  on  indexed  relations  is 
measured  over  a  wide  range.  Wnat  type  of  index  (i.e.t  clu¬ 
stered  or  non-ciusterea )  provides  tne  best  performance  and 
wnat  are  tne  trade-offs?  rfnat  scope  of  indices  (i.e.,  one 
attribute,  two,  or  more  )  provides  tne  test  performance? 
Tne  latter  question  may  be  one  dependent  on  the  application. 
In  testing  tne  RDM  1120,  it  nas  been  notei  that,  if  tne 
index  is  defined  wnen  tne  relation  is  being  created,  tnen 


tne  size  of  a  relation  witn  a  clustered  index  is  larger  then 


tne  size  of  tne  same  relation  if  tne  index  is  defined  after 
tne  data  nas  been  entered  into  tne  relation.  Tr.is  is  ee- 
cause  tne  loading  algoritnm  assumes  a  normal  distribution  of 
Key  values,  wniie  tne  data  is  in  Key  sequence,  data  loalei 
nas  been  generated  already  sorted. 

Additional  testing  snould  be  performed  to  get  a 
'feel'  of  tne  system.  3 y  becoming  familiar  witn  tne  sy¬ 
stem's  capabilities,  tne  testing  personnel  snould  oe  able  to 
determine  interesting  lines  of  experimentation.  Areas  of 
interest  include  tne  overneaa  associated  witn  projection 
operations,  tne  use  of  string  compression  tecnni^ues,  and 
tne  efficiency  of  join  operations  (in  different  available 
memory  configurations,  wnen  available). 

4.  Pna«,e  F^ur  =  Verification 

Tne  last  pnase  tares  place  after  tne  otner  tests 
nave  been  reviewed  and  grapned.  Analysis  of  tne  previous 
tests  snould  provide  some  meaningful  results  about  system 
performance  in  general,  and  m  particular  areas.  Tne  veri¬ 
fication  pnase  serves  to  perform  tests  wnicn  verify  or 
disprove  tne  analysis  of  tne  previous  tests.  It  also  pro¬ 
vides  an  opportunity  to  reoo  any  tests  wnicn  appear  errone¬ 
ous  or  suspicious.  In  tnls  pnase,  additional  tests  may  tars 
advantage  of  tne  flexibility  designed  into  tne  syntnetir 
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C.  SUMMARY 


Investigation  of  tne  performance  of  several  configura¬ 
tions  of  a  bactend  relational  database  macnine  nas  provider 
considerable  insight  into  wnat  may  be  a  sound  basis  for 
general  performance  testing  on  relational  DBMSs.  In  tnis 
tftesis,  a  methodology  has  been  laid  out  and  tne  initial 
pnases  to  be  tasen  in  tnat  metnodoiogy  nave  oeen  lefined.  A 
complete  framework  for  subsequent  pnases  nas  net  beer  fully 
levelopel,  out  their  contents  nave  been  discussed.  While 
tne  tests  described  relate  to  a  specific  series  of  relation¬ 
al  database  macnines,  tne  basic  metnoaoiogy  may  apply  to 
relational  database  machines. 
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APPENDIX  A 
ID*  S£S  t£.T  lAll£S 

System  Tables  , 

1.  Databases  -  catalog  of  databases  in  tne  system 

2.  DISKS  -  list  of  diSKS  Known  to  system 

3.  Lock  -  used  by  IDf*  for  concurrency  control 

4.  Configure  -  information  about  serial  ana  parallel  in¬ 
terfaces,  cnecKpoint  interval 

5.  DBinstat  -  information  about  current  activity  in  tne 
IDM 

Rataoasg.  2l£l£l 

1.  Relation  -  catalog  of  all  objects  (relation,  view, 
stored  command)  in  tne  database 

2.  Attribute  -  catalog  of  eacn  attribute  of  eacn  relation 

3.  Indices  -  catalog  of  indices  tnat  exist  in  tne  database 

4.  Protect  -  cataioe  of  protection  information  in  tne  da¬ 

tabase 

5.  Query  -  stored  commands  and  view 

5.  Crossreference  -  catalog  of  dependencies  among  relations, 
views  and  stored  commands 

7.  Transact  -  transaction  lo?*in*  relation 

8.  Users  -  mapping  of  user  and  group  names  to  user  ID 

9.  Host_Users  -  mapping  from  .dost  ID  and  user  ID  to  IDM  ID 

IP.  BlocKalloc  -  catalog  of  diss  olocKs 

11.  DisK_Usa<re  -  database  allocation 

12.  Baton  -  temporary  transaction  logging  relation 

13.  Descriptions  -  user  definable  descriptions 
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APPENDIX  B 


Database  Generator  Program 

(CMS 

PascalVS) 
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VAR 

It  Jt  K.  L  :  INTEGER  * 

BUFF_STPING  :  STRING  (STR_LENIJ 

DESCRIPTION  :  TEXT) 
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alala^k:  ■-' 


FUNCTION  KO.MCRE  :  BOOLEAN} 

4*  PROMPTS  THE  USER  TC  SEE  IF  HE  WANTS  TO  GENERATE  ANOTHER  VALUESET 
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